{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# DSPy with Weaviate + LangWatch DSPy Visualizer\n",
        "\n",
        "This notebook shows an example of DSPy RAG program using Weaviate as the vector database and LangWatch for visualization of the DSPy optimization process."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Pgy1Fjhh_lOB"
      },
      "outputs": [],
      "source": [
        "# Install weaviate and dspy along with langwatch for the visualization\n",
        "%pip install weaviate-client \"dspy-ai[weaviate]\" langwatch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1. Load Data into Weaviate\n",
        "\n",
        "You need a running Weaviate cluster with data:\n",
        "\n",
        "1. Learn about the installation options here\n",
        "2. Import your data:\n",
        "\n",
        "    a. You can follow the [Weaviate-Import.ipynb](https://github.com/weaviate/recipes/blob/main/integrations/llm-frameworks/dspy/Weaviate-Import.ipynb) notebook to load in the Weaviate blogs\n",
        "  \n",
        "    b. Or follow this [Quickstart Guide](https://weaviate.io/developers/weaviate/quickstart)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "51OWavv1CCVV"
      },
      "source": [
        "## 2. Prepare the LLM and Retriever"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xycw8IWs_qnt",
        "outputId": "40844780-608a-4162-cac7-35f57c5764f1"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "LLM test response: ['Hello! How can I assist you today?']\n",
            "Retriever test response: {'long_text': \"LLMs are a versatile tool that is seen in many applications like chatbots, content creation, and much more. Despite being a powerful tool, LLMs have the drawback of being too general. Reasoning: Let's think step by step in order to **produce the query. We need to identify the unique aspects of the document that would allow us to formulate a question that this document can answer. The document seems to focus on the combination of LangChain and Weaviate, mentioning the benefits of LangChain in overcoming limitations of LLMs such as hallucination and limited input lengths.\"}\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
            "I0000 00:00:1721929852.765644 4075415 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "\n",
        "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OPENAI_API_KEY: \")\n",
        "\n",
        "import dspy\n",
        "from dspy.retrieve.weaviate_rm import WeaviateRM\n",
        "import weaviate\n",
        "\n",
        "llm = dspy.OpenAI(\n",
        "    model=\"gpt-4o-mini\",\n",
        "    max_tokens=4096,\n",
        "    temperature=0,\n",
        "    api_key=os.environ[\"OPENAI_API_KEY\"]\n",
        ")\n",
        "\n",
        "print(\"LLM test response:\", llm(\"hello there\"))\n",
        "\n",
        "weaviate_client = weaviate.connect_to_local()\n",
        "retriever_model = WeaviateRM(\"WeaviateBlogChunk\", weaviate_client=weaviate_client)\n",
        "\n",
        "print(\"Retriever test response:\", retriever_model(\"LLMs\")[0])\n",
        "\n",
        "dspy.settings.configure(lm=llm, rm=retriever_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YIAYLNlcCFdO"
      },
      "source": [
        "## 3. Prepare the Dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "(30, 20)"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import httpx\n",
        "\n",
        "dataset = httpx.get(\"https://raw.githubusercontent.com/weaviate/recipes/main/integrations/llm-frameworks/dspy/WeaviateBlogRAG-0-0-0.json\").json()\n",
        "\n",
        "gold_answers = []\n",
        "queries = []\n",
        "\n",
        "for row in dataset:\n",
        "    gold_answers.append(row[\"gold_answer\"])\n",
        "    queries.append(row[\"query\"])\n",
        "\n",
        "data = []\n",
        "\n",
        "for i in range(len(gold_answers)):\n",
        "    data.append(dspy.Example(gold_answer=gold_answers[i], question=queries[i]).with_inputs(\"question\"))\n",
        "\n",
        "trainset, devset = data[:30], data[30:50]\n",
        "\n",
        "len(trainset), len(devset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KOXtqnmfCNzS"
      },
      "source": [
        "## 4. Define the RAG model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WxAaf1IABgxM",
        "outputId": "03155b63-bbaf-4ceb-bf26-07cda4f52b6f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[Devset] Question: What is the strategy for chunking text for vectorization when dealing with a 512 token length limit?\n",
            "[Devset] Answer: The strategy for chunking text for vectorization when dealing with a 512 token length limit involves using a Large Language Model to identify suitable places to cut up text chunks. This process, known as \"chunking\", breaks down long documents into smaller sections, each containing an important piece of information. This approach not only helps to stay within the LLMs token limit but also enhances the retrieval of information. It's important to note that the chunking should be done thoughtfully, not just splitting a list of items into 2 chunks because the first half fell into the tail end of a chunk[:512] loop.\n",
            "[Prediction] Question: What is the strategy for chunking text for vectorization when dealing with a 512 token length limit?\n",
            "[Prediction] Predicted Answer: Context: The provided context discusses the limitations of vector embedding models, particularly the 512 token length constraint, and suggests strategies for effectively chunking text for vectorization. It highlights the importance of identifying optimal places to cut text chunks, such as avoiding arbitrary splits that could disrupt the coherence of the content, and emphasizes the use of summarization techniques to handle longer documents.\n",
            "\n",
            "Question: What is the strategy for chunking text for vectorization when dealing with a 512 token length limit?\n",
            "\n",
            "Answer: The strategy for chunking text for vectorization within the 512 token length limit involves several key considerations. First, it is essential to identify logical and coherent points in the text where it can be divided, rather than making arbitrary cuts that could disrupt the flow of information. For instance, when dealing with lists or structured data, care should be taken to avoid splitting items in a way that separates related content. \n",
            "\n",
            "Additionally, for longer documents that exceed the token limit, summarization techniques can be employed. This involves creating a summary of the text, which can then be vectorized and indexed for semantic search. By summarizing the content, the overall length is reduced while still capturing the essential information, making it suitable for embedding within the token constraints.\n",
            "\n",
            "In practice, a good chunk size is often around a single paragraph or less, which typically fits within the 512 token limit. However, in special cases where longer texts need to be embedded, models with larger context windows may be necessary. Overall, the strategy emphasizes coherence, logical segmentation, and the use of summarization to effectively manage the limitations of vectorization models.\n"
          ]
        }
      ],
      "source": [
        "class GenerateAnswer(dspy.Signature):\n",
        "    \"\"\"Assess the the context and answer the question.\"\"\"\n",
        "\n",
        "    context = dspy.InputField(desc=\"Helpful information for answering the question.\")\n",
        "    question = dspy.InputField()\n",
        "    answer = dspy.OutputField(desc=\"A detailed answer that is supported by the context.\")\n",
        "\n",
        "\n",
        "class RAG(dspy.Module):\n",
        "    def __init__(self, k=3):\n",
        "        super().__init__()\n",
        "\n",
        "        self.retrieve = dspy.Retrieve(k=k)\n",
        "        self.generate_answer = dspy.Predict(GenerateAnswer)\n",
        "\n",
        "    def forward(self, question):\n",
        "        context = self.retrieve(question).passages\n",
        "        pred = self.generate_answer(context=context, question=question).answer\n",
        "        return dspy.Prediction(context=context, answer=pred, question=question)\n",
        "\n",
        "\n",
        "dev_example = devset[0]\n",
        "print(f\"[Devset] Question: {dev_example.question}\")\n",
        "print(f\"[Devset] Answer: {dev_example.gold_answer}\")\n",
        "\n",
        "generate_answer = RAG()\n",
        "\n",
        "pred = generate_answer(question=dev_example.question)\n",
        "\n",
        "# Print the input and the prediction.\n",
        "print(f\"[Prediction] Question: {dev_example.question}\")\n",
        "print(f\"[Prediction] Predicted Answer: {pred.answer}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5. Define your Metric"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 110,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "422.5"
            ]
          },
          "execution_count": 110,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "class TypedEvaluator(dspy.Signature):\n",
        "    \"\"\"Evaluate the quality of a system's answer to a question according to a given criterion.\n",
        "    Please be a bit harsh, only give a 5 to a truly above and beyond answer.\n",
        "    \"\"\"\n",
        "\n",
        "    criterion: str = dspy.InputField(desc=\"The evaluation criterion.\")\n",
        "    question: str = dspy.InputField(desc=\"The question asked to the system.\")\n",
        "    ground_truth_answer: str = dspy.InputField(desc=\"An expert written Ground Truth Answer to the question.\")\n",
        "    predicted_answer: str = dspy.InputField(desc=\"The system's answer to the question.\")\n",
        "    rating: float = dspy.OutputField(desc=\"A float rating between 1 and 5\")\n",
        "\n",
        "\n",
        "def MetricWrapper(gold, pred, trace=None):\n",
        "    alignment_criterion = \"How aligned is the predicted_answer with the ground_truth?\"\n",
        "    return dspy.TypedPredictor(TypedEvaluator)(criterion=alignment_criterion,\n",
        "                                          question=gold.question,\n",
        "                                          ground_truth_answer=gold.gold_answer,\n",
        "                                          predicted_answer=pred.answer).rating\n",
        "\n",
        "from dspy.evaluate.evaluate import Evaluate\n",
        "\n",
        "evaluate = Evaluate(devset=devset, num_threads=4, display_progerss=False)\n",
        "\n",
        "uncomplied_score = evaluate(RAG(), metric=MetricWrapper)\n",
        "uncomplied_score"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ytbRU9jJCSj8"
      },
      "source": [
        "## 6. Connect to LangWatch"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lF9DxTGeCU15",
        "outputId": "29ebbacf-9ca8-4322-fd77-32e47f4c93aa"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Please go to https://app.langwatch.ai/authorize to get your API key\n",
            "LangWatch API key set\n"
          ]
        }
      ],
      "source": [
        "import langwatch\n",
        "\n",
        "langwatch.login()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o69S-BlkE-bV"
      },
      "source": [
        "## 7. Start Training Session!\n",
        "\n",
        "This will cost around $0.40"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Ef67q2B-FCIP",
        "outputId": "cefe1935-1ef3-46ad-a54c-74f08bfb75f0"
      },
      "outputs": [],
      "source": [
        "from dspy.teleprompt import MIPROv2\n",
        "import dspy.evaluate\n",
        "\n",
        "# use gpt-4o as the prompt model to teach gpt4-mini\n",
        "teacher = dspy.OpenAI(\n",
        "    model=\"gpt-4o\", max_tokens=4096, temperature=0, api_key=os.environ[\"OPENAI_API_KEY\"]\n",
        ")\n",
        "\n",
        "# Set up a MIPROv2 optimizer, which will compile our RAG program.\n",
        "optimizer = MIPROv2(\n",
        "    metric=MetricWrapper,\n",
        "    prompt_model=teacher,\n",
        "    task_model=llm,\n",
        "    num_candidates=3,\n",
        "    init_temperature=0.7,\n",
        ")\n",
        "\n",
        "# Initialize langwatch for this run, to track the optimizer compilation\n",
        "langwatch.dspy.init(experiment=\"weaviate-blog-rag-experiment\", optimizer=optimizer)\n",
        "\n",
        "# Compile\n",
        "compiled_rag = optimizer.compile(\n",
        "    RAG(),\n",
        "    trainset=trainset,\n",
        "    num_batches=10,\n",
        "    max_bootstrapped_demos=5,\n",
        "    max_labeled_demos=5,\n",
        "    eval_kwargs=dict(num_threads=16, display_progress=True, display_table=0),\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Screenshot:\n",
        "\n",
        "![optimization screenshot](./weaviate_setup/optimization_screenshot.png)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 115,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "430.0\n",
            "Congratulations! We optimized the RAG program and bumped the score from 422.5 to 430.0!\n"
          ]
        }
      ],
      "source": [
        "complied_score = evaluate(compiled_rag, metric=MetricWrapper)\n",
        "print(complied_score)\n",
        "\n",
        "print(f\"Congratulations! We optimized the RAG program and bumped the score from {uncomplied_score} to {complied_score}!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 8. Save the optimized RAG\n",
        "\n",
        "You can now use your optimized RAG program for inference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "u5vA80_JJX-q"
      },
      "outputs": [],
      "source": [
        "compiled_rag.save(\"optimized_rag.json\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 9. Final Step: Instrument your DSPy program for Production\n",
        "\n",
        "Now that you have your optimized RAG, you are ready to deploy, but usage in production can also be unpredictable.\n",
        "\n",
        "To keep track of which documents are being retrieved and used by your RAG in production, you can use the `langwatch.trace()` decorator to instrument your DSPy program ([docs for more details](https://docs.langwatch.ai/integration/python/guide#capturing-llm-spans))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Trace Public URL: https://app.langwatch.ai/share/iNwGfWN3E3EzkoNjQxwK1\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "Prediction(\n",
              "    context=['We can then vectorize this text description using off-the-shelf models from OpenAI, Cohere, HuggingFace, and others to unlock semantic search. We recently presented an example of this idea for [AirBnB listings](https://weaviate.io/blog/generative-feedback-loops-with-llms), translating tabular data about each property’s price, neighborhood, and more into a text description. Huge thanks to Svitlana Smolianova for creating the following animation of the concept. <img\\n    src={require(\\'./img/gen-example.gif\\').default}\\n    alt=\"alt\"\\n    style={{ width: \"100%\" }}\\n/>\\n\\n### Text Chunking\\nSimilarly related to the 512 token length for vectorizing text chunks, we may consider using the Large Language Model to identify good places to cut up text chunks. For example, if we have a list of items, it might not be best practice to separate the list into 2 chunks because the first half fell into the tail end of a chunk[:512] loop.', '### Summarization Indexes\\nVector embedding models are typically limited to 512 tokens at a time. This limits the application of vector embeddings to represent long documents such as an entire chapter of a book, or the book itself. An emerging solution is to apply summarization chains such as the create-and-refine example animated at the end of this article to summarize long documents. We then take the summary, vectorize it, and build an index to apply semantic search to the comparison of long documents such as books or podcast transcripts. ### Extracting Structured Data\\nThe next application of LLMs in Search Index Construction is in extracting structured data from unstructured text chunks.', 'For common RAG applications, a good chunk size for an embedding is typically about a single paragraph of text or less. In this case, models with max tokens of 512 should be sufficient. However, there will be special cases where you need to embed longer source texts that require models with a larger context window. ![mteb leaderboard](./img/mteb.png)\\n\\n## Step 3: Evaluate the model on your use case\\n\\nWhile the MTEB Leaderboard is a great place to start, you should take its results cautiously and skeptically. Bear in mind that these results are self-reported.'],\n",
              "    answer='Context: The context discusses the limitations of vector embedding models, which are typically restricted to processing 512 tokens at a time. It emphasizes the importance of effective text chunking strategies to optimize the vectorization process, particularly when dealing with long documents or extensive text data. The context also mentions the use of Large Language Models (LLMs) to identify optimal places for cutting text into chunks, ensuring that the chunks remain coherent and meaningful for vectorization.\\n\\nQuestion: What is the strategy for chunking text for vectorization when dealing with a 512 token length limit?\\n\\nAnswer: To effectively chunk text for vectorization within the 512 token length limit, you can follow these strategies:\\n\\n1. **Identify Natural Breaks**: Use the content structure to identify natural breaks in the text, such as paragraphs, sentences, or sections. This helps maintain the coherence of the information within each chunk.\\n\\n2. **Use Large Language Models (LLMs)**: Leverage LLMs to analyze the text and suggest optimal chunking points. These models can help determine where to split the text based on semantic meaning, ensuring that each chunk remains contextually relevant.\\n\\n3. **Limit Chunk Size**: Aim to keep each chunk to a single paragraph or less, as this typically fits well within the 512 token limit. This approach allows for more manageable and meaningful vector representations.\\n\\n4. **Summarization for Long Texts**: For longer documents that exceed the token limit, consider applying summarization techniques. Summarize the content first, then vectorize the summary to create an index for semantic search.\\n\\n5. **Iterative Testing**: After chunking, test the vectorization process to ensure that the resulting embeddings are effective for your specific use case. Adjust the chunk sizes as necessary based on the performance of the semantic search.\\n\\nBy implementing these strategies, you can optimize the chunking of text for vectorization, ensuring that the resulting embeddings are both efficient and meaningful for subsequent semantic search tasks.',\n",
              "    question='What is the strategy for chunking text for vectorization when dealing with a 512 token length limit?'\n",
              ")"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import langwatch\n",
        "\n",
        "compiled_rag = RAG()\n",
        "compiled_rag.load(\"optimized_rag.json\")\n",
        "\n",
        "@langwatch.trace()\n",
        "def generate_response(question: str):\n",
        "  langwatch.get_current_trace().autotrack_dspy()\n",
        "  public_url = langwatch.get_current_trace().share()\n",
        "  print(f\"Trace Public URL: {public_url}\")\n",
        "\n",
        "  return compiled_rag(question=question)\n",
        "\n",
        "\n",
        "generate_response(dev_example.question)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Screenshot:\n",
        "\n",
        "![Tracing Screenshot](./weaviate_setup/tracing_screenshot.png)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
